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Abstract 

This  is  the  final  report  for  EOARD  project  #033060  “Speaker  verifica¬ 
tion  using  a  dynamic,  ‘articulatory’  segmental  hidden  Markov  model”. 

A  segmental  HMM  is  a  HMM  whose  states  are  associated  with  sequences 
of  acoustic  feature  vectors  rather  than  individual  vectors.  This  report  de¬ 
scribes  the  results  of  experiments  in  which  such  a  model  is  applied  to  text- 
dependent  and  -independent  speaker-detection  on  the  YOHO  and  Switch¬ 
board  corpora,  respectively.  Text-dependent  speaker  verification  results  on 
YOHO  using  a  simple  segmental  HMM  show  a  44%  reduction  in  false  accep¬ 
tances  compared  with  a  conventional  HMM.  A  type  of  ‘segmental  GMM’ 
is  then  described  for  text-independent  speaker  detection.  In  order  to  ap¬ 
ply  this  model  to  the  NIST  2003  single-speaker  test  set,  various  techniques 
are  developed  to  reduce  its  computational  load.  A  range  of  experiments  are 
then  reported  which  investigate  the  utility  of  different  aspects  of  this  model 
for  text-independent  speaker-detection.  From  these  experiments  we  have 
been  unable  to  demonstrate  a  benefit,  in  terms  of  text-independent  speaker- 
detection  accuracy,  from  the  use  of  dynamic  segment  models  corresponding 
to  linear  trajectories  with  non-zero  slope.  Consequently  we  have  also  been 
unable  to  demonstrate  any  benefit  from  the  use  of  longer  segments.  Thus 
there  is  little  evidence  from  these  experiments  that  non-stationary  sections 
of  a  speech  signal  contain  important  individual  differences  which  can  be 
exploited  for  speaker-detection.  If  this  is  true,  it  goes  some  way  towards 
explaining  the  success  of  GMM-based  approaches.  We  conclude  that  further 
work,  to  determine  definitively  the  contribution  of  non-stationary  segments 
to  speaker-detection,  is  needed. 
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1  Introduction 


This  is  the  final  report  for  EOARD  project  #033060  “ Speaker  verification  using 
a  dynamic,  ‘articulatory’  segmental  hidden  Markov  model”,  which  started  on 
1st  October  2003.  The  report  describes  technical  progress  which  has  been  made, 
discusses  the  speaker  verification  results,  and  outlines  possible  future  work. 

Some  recent  work  in  speech  recognition  conducted  as  part  of  the  ‘Balthasar’ 
project1  (Russell  and  Jackson  2005)  at  the  University  of  Birmingham  has  re¬ 
sulted  in  a  class  of  novel,  multiple-level  Segmental  Hidden  Markov  Models 
(MSHMM)  in  which  the  relationship  between  symbolic  and  acoustic  representa¬ 
tions  of  a  speech  signal  is  regulated  by  an  intermediate  ‘articulatory’  layer  (figure 
1).  Each  state  of  the  model  is  associated  with  variable-duration  trajectories  in 
the  ‘articulatory’  space,  which  are  mapped  into  the  acoustic  space  using  one 
or  more  ‘articulatory-to-acoustic’  mappings.  Comparison  with  unknown  speech 
data,  for  the  purposes  of  probability  calculations,  takes  place  in  the  acoustic 
space.  A  similar  approach  has  been  studied  by  Deng  and  Ma  (Deng  and  Ma 
2000). 


acoustic  layer  (e.g.,  MFCCs) 


articulatory-to-acoustic  mapping 


intermediate  layer 


finite-state  process 


Figure  1:  A  segmental  models  that  uses  linear  trajectories  in  an  intermediate  space. 

Such  an  approach  has  many  potential  advantages  for  speech  pattern  process¬ 
ing.  For  example,  in  acoustic  representations  of  speech  (derived  from  short-term 
log-power  spectra)  articulator  dynamics  are  manifested  indirectly,  often  as  move¬ 
ment  between,  rather  than  within,  frequency  bands.  Intuitively,  therefore,  it 
would  be  much  better  to  model  dynamics  directly,  in  an  articulatory-based  repre¬ 
sentation.  Also,  by  incorporating  an  articulatory  representation  (or  at  least  one 
which  is  more  closely  related  to  an  articulatory  representation  than  conventional 
spectrum-based  acoustic  representations),  it  may  be  possible  to  characterise  the 
production  strategies  that  give  rise  to  variability  in  fluent,  conversational  speech. 
Thus  it  was  hoped  that  such  a  model  would  improve  speech  recognizer  perfor¬ 
mance  by  modelling  the  underlying  mechanisms  that  cause  variability,  rather 
than  relying  solely  on  generic  statistical  modelling  techniques. 

1The  “Balthasar”  project  was  funded  by  EPSRC  grant  GR/M87146  “An 
integrated  multiple-level  statistical  model  for  speech  pattern  processing”  (see 
http://web.bham.ac.Uk/p.jackson/balthasar) 
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This  approach  should  also  have  benefits  for  speaker  detection,  and  the  goal 
of  this  project  is  to  apply  M-SHMMs  to  that  problem.  The  benefits  should  fall 
into  two  categories: 

•  Those  which  derive  from  the  incorporation  of  an  explicit  ‘articulatory- 
related’  representation  into  a  statistical  model,  and 

•  Those  which  derive  from  the  improved  modelling  of  speech  dynamics  and 
duration  which  results  from  the  use  of  a  segmental  framework. 

In  the  first  category,  inter-speaker  differences  which  result  from  physiological 
factors,  such  as  the  differences  between  an  adult’s  vocal  tract  and  that  of  a 
child,  should  be  represented  explicitly  in  the  articulatory  layer  rather  than  in¬ 
directly  through  their  acoustic  correlates.  The  model  should  also  enable  indi¬ 
vidual  differences  in  the  articulatory  strategies  used  by  a  speaker  during  speech 
production  to  be  exposed  and  modelled  explicitly.  Furthermore,  provided  that 
the  articulatory-based  representation  is  sufficiently  compact,  there  should  also 
be  significant  advantages  for  speaker  adaptation  from  limited  amounts  of  data, 
since  less  data  will  be  needed  to  train  the  smaller  number  of  parameters.  As  an 
illustration,  in  (Russell  and  Jackson  2005)  it  is  shown  that  a  triphone  M-SHMM 
system  with  an  intermediate  representation  based  on  just  3  formant  frequen¬ 
cies  can  achieve  better  phone  classification  results  on  TIMIT,  while  at  the  same 
time  having  25%  fewer  parameters  than  the  conventional  system.  Of  course,  in 
order  to  realise  this  benefit  fully  it  will  be  necessary  to  extend  speaker  adapta¬ 
tion  techniques,  such  as  MAP  (Gauvain  and  Lee  1994)  or  MLLR  (Leggetter  and 
Woodland  1995)  to  the  articulatory  layer  of  a  M-SHMM.  This  is  currently  being 
studied  in  a  separate  PhD  project. 

The  second  type  of  potential  benefit  derives  from  improved  model  of  speech 
dynamics  and  duration.  The  model  should  be  able  to  capture  individual  differ¬ 
ences  in  non-stationary  speech  segments  which  might  otherwise  be  swamped  by 
large  variance  due  to  the  HMM  piecewise  stationarity  assumption.  Thus  it  is 
plausible  that  such  a  model  will  improve  our  understanding  of  inter-speaker  dif¬ 
ferences,  and  hence  improve  speaker  detection  performance,  by  modelling  some 
of  the  underlying  mechanisms  that  give  rise  to  intra-  and  inter-speaker  differ¬ 
ences.  It  would  also  be  possible  to  determine  whether  non-stationary  speech 
segments  are  any  more  or  less  useful  for  speaker  detection  than  stationary  seg¬ 
ments. 

The  speaker-detection  experiments  described  in  this  report  focus  on  the  sec¬ 
ond  set  of  factors.  In  other  words  we  apply  simple  linear-trajectory  segmental 
HMMs  in  which  the  intermediate  representation  is  absent,  to  speaker  detection. 
These  M-SHMMs  are  equivalent  to  the  ‘Fixed  Trajectory’  segmental  HMMs  de¬ 
scribed  in  (Holmes  and  Russell  1999).  In  practice,  this  type  of  segmental  HMM  is 
realised  in  the  ‘SEGVit’  software  toolkit  by  setting  the  intermediate  space  equal 
to  the  acoustic  space,  and  by  setting  the  ‘articulatory-to-acoustic’  mapping  to 
be  the  identity  mapping  /. 
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The  report  is  organised  as  follows.  In  section  2  we  present  some  of  the  rele¬ 
vant  background  results  on  the  application  of  M-SHMMs  to  phone  recognition 
from  the  earlier  ‘Balthasar’  project.  Section  3  is  a  brief  description  of  relevant  as¬ 
pects  of  the  theory  of  M-SHMMs.  The  main  part  of  the  report  presents  results  of 
two  sets  of  experiments,  namely  text-dependent  speaker  detection  experiments 
on  the  YOHO  corpus  (section  4),  and  text-independent  speaker-detection  ex¬ 
periments  on  a  subset  of  the  2003  NIST  speaker  recognition  evaluation  test  set 
(section  5).  The  first  set  of  experiments  were  also  reported  at  the  2004  ‘Odyssey’ 
Speaker  Recognition  Workshop  in  Toledo,  Spain,  in  June  2004  (Liu,  Russell,  and 
Carey  2004).  Our  conclusions,  and  suggestions  for  further  work,  are  set  out  in 
section  8. 

2  Relevant  results  from  the  ‘Balthasar’  Project 

In  this  section  we  review  some  of  the  relevant  results  on  general  M-SHMMs  from 
the  phone  classification  experiments  reported  in  (Russell  and  Jackson  2005). 

A  potential  problem  with  the  type  of  multiple-level  model  described  in  the 
previous  section  is  that  any  advantages  which  are  gained  by  the  introduction  of 
an  intermediate  layer  may  be  compromised  by  inadequacies  of  the  articulatory 
representation  or  articulatory-to-acoustic  mapping,  or  theoretical  compromises 
made  for  mathematical  or  computational  tractability. 

In  (Russell  and  Jackson  2005)  M-SHMMs  were  studied  in  which  the  inter¬ 
mediate  representation  is  based  on  the  control  parameter  set  for  the  Holrnes- 
Mattingley-Shearme  (HMS)  parallel  formant  synthesiser  (Holmes,  Mattingly, 
and  Shearme  1964).  Three  different  formant-based  intermediate  parameteri- 
sations  were  considered: 

•  3FF  -  the  first  three  formant  frequencies,  FI,  F2  and  F3.  It  is  clear 
that  these  three  parameters  alone  do  not  contain  sufficient  information 
to  reconstruct  a  short  term  spectrum  (or  MFCC  vector)  unambiguously. 
For  example,  there  is  no  data  concerning  formant  amplitudes 

•  3FF+5BE  -  the  first  three  formant  frequencies  plus  5  band  energies 

•  12PFS  -  the  complete  set  of  12  Holmes-Mattingley-Shearme  parallel  for¬ 
mant  synthesiser  control  parameters  (Holmes,  Mattingly,  and  Shearme 
1964).  Experiments  conducted  by  Holmes  in  the  early  1970’s  demonstrated 
that  these  parameters,  if  chosen  correctly,  are  sufficient  to  synthesise  nat¬ 
ural  sounding  speech  (Holmes  1973). 

In  all  of  these  experiments  speech  dynamics  were  modelled  using  linear  trajec¬ 
tories,  and  the  articulatory-to-acoustic  mapping  was  realised  as  a  set  of  one  or 
more  linear  mappings  (Russell  and  Jackson  2005). 

It  is  easy  to  see  that  a  linear  ‘articulatory-to-acoustic’  mapping  is  not  suf¬ 
ficient  for  speech  pattern  modelling  (Richards  and  Bridle  1999).  For  example, 
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consider  the  case  where  speech  is  represented  in  the  acoustic  domain  as  the  out¬ 
put  of  a  set  of  D'2  uniformly-spaced  band-pass  filters  spanning  frequencies  up  to 
4  kHz,  and  /  is  a  hypothetical  ‘formant’  trajectory,  with  unit  amplitude,  whose 
frequency  increases  linearly  from  100  Hz  to  4  kHz.  The  corresponding  trajectory 
in  acoustic  space  is  a  complex  path  over  the  surface  of  the  D-2  dimensional  unit 
sphere,  which  passes  through  each  of  the  axes  in  turn.  Such  a  trajectory  clearly 
cannot  be  realised  as  the  image  of  /  under  a  single  linear  mapping.  However, 
previous  experience  has  shown  that  even  relatively  small  deviations  from  the 
conventional  HMM  framework  can  result  in  significant  difficulties  and  poor  per¬ 
formance.  Therefore  it  was  judged  that  a  proper  understanding  of  the  issues 
which  arise  in  the  implementation  of  a  system  with  linear  transformations  is 
essential  before  attempting  to  deal  with  more  complex  non-linear  systems. 

The  key  results  reported  in  (Russell  and  Jackson  2005)  are: 

•  There  is  a  theoretical  upper  bound  on  the  performance  of  a  linear  M- 
SHMM,  which  is  better  than  that  obtained  with  a  comparable  conventional 
HMM 

•  This  upper  bound  can  be  attained  by  appropriate  choice  of  ‘articulatory’ 
representation  and  articulatory-to-acoustic  mappings 

•  There  is  a  trade-off  between  the  dimension  of  the  ‘articulatory-based’  space 
and  the  number  of  different  mappings  which  make  up  the  piecewise-linear 
‘articulatory-to-acoustic’  mapping.  For  example,  optimal  performance  can 
be  achieved  by  using  all  12  HMS  synthesiser  control  parameters  and  a 
single  (phone-independent)  linear  mapping,  or  by  using  fewer  parameters 
but  more,  phone-dependent,  mappings  (Russell  and  Jackson  2005) 

The  significance  of  this  result,  in  general,  is  that  it  provides  a  solid  theoretical 
foundation  for  the  development  of  richer  classes  of  multi-level  models,  which 
include  non-linear  models  of  dynamics,  alternative  articulatory  representations, 
sets  of  non-linear  articulatory-to-acoustic  mappings,  and  integrated  optimisation 
schemes  that  support  unsupervised  learning  of  the  trajectory,  intermediate  rep¬ 
resentation  and  mapping  parameters.  Moreover,  these  speech  recognition  results 
also  motivate  the  application  of  M-SHMMs  to  speaker  detection. 

3  Overview  of  the  theory  of  M-SHMMs 

The  purpose  of  this  section  is  to  explain  the  basic  theory  of  multiple-level, 
trajectory-based  segmental  HMMs  (M-SHMMs)  and  the  simpler  fixed  trajec¬ 
tory  segmental  HMMs  used  in  our  experiments.  Full  details  are  presented  in 
(Russell  and  Jackson  2005)  and  (Holmes  and  Russell  1999).  This  section  should 
be  skipped  by  anyone  who  is  familiar  with  these  papers. 
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3.1  Definitions 


As  explained  earlier,  a  M-SHMM  is  a  particular  type  of  segmental  hidden  Markov 
model  (SHMM)  (Ostendorf,  Digalakis,  and  Kimball  1996).  In  other  words,  the 
states  of  a  MSHMM  are  associated  with  sequences  of  feature  vectors,  or  segments , 
rather  than  individual  vectors.  The  model  is  called  ‘multiple- level’  because  it 
considers  two  levels  of  representation  of  a  speech  signal:  a  D\  dimensional  ‘ar¬ 
ticulatory’  space  1  and  a  dimensional  acoustic  space  A.  In  (Russell  and 
Jackson  2005)  the  ‘articulatory’  and  acoustic  spaces  are  based  on  formants  and 
Mel-Frequency  Cepstral  Coefficients  (MFCCs),  respectively. 

3.2  The  multiple-level,  linear-trajectory  segment  model 

A  state  a \  of  a  M-SHMM  is  identified  with  a  variable  duration  linear  trajectory 
in  1  which  is  mapped  into  A  by  a  linear  ‘articulatory-to-acoustic’  mapping.  A 
state  is  parameterised  by  two  D\  dimensional  (articulatory)  vectors,  namely  the 
mid-point  vector  Cj  and  slope  vector  m*,  a  D2  x  D2  (acoustic)  covariance  matrix 
Vi ,  and  a  linear  ‘articulatory-to-acoustic’  mapping  Wt  :  I  —>  A.  A  trajectory  f 
of  length  r  is  defined  by: 


fi(t)  =  {t-  t)mi  +  c i  (1) 

where  t  =  (r+l)/2,  and  the  function  of  Wt  is  to  map  this  ‘articulatory’  trajectory 
in  1  into  the  acoustic  space  A.  If  Y{  =  [yi,  y2,  ■  ■  ■ ,  yT]  is  a  sequence  of  acoustic 
feature  vectors  in  A,  then  the  probability  (density)  of  Y{  given  state  cq  is  given 
by: 

p(Y{\*i)  =  bi(Y[)  =  di(r)  f[Af(yf,Wi(m),V) ,  (2) 

t= 1 

where  dj(r)  is  the  probability  that  state  cr,  emits  a  segment  of  length  r,  and 
J\f  (y*;  Wj(fj(t)),  Vi)  is  a  D-2  dimensional  Gaussian  probability  density  function 
(PDF)  with  mean  Wj(fj(t))  and  covariance  matrix  V)  (it  is  assumed  tha  V)  is 
diagonal) . 

In  the  special  case  where  I  =  A  and  IF,;  is  the  identity  matrix,  this  reduces 
to  a  Fixed  Trajectory  Segmental  HMM  (Holmes  and  Russell  1999)  and  equation 
2  becomes: 

T 

PpiWi)  =  bi{Y{)  =  di(r)  Y[Af(yt-,m,V) ,  (3) 

t= 1 


3.3  The  segmental  Viterbi  decoder 

Let  M.  be  an  S'-state  MSHMM  (for  simplicity,  it  is  assumed  that  the  probability 
of  a  transition  from  at  to  state  Gj  is  zero  unless  j  >  i).  Suppose  that  the  acoustic 
sequence  T,7  =  [y  1 yr]  corresponds  to  several  states/segments.  Then  M. 
can  only  explain  Y  via  a  state  sequence  x  =  [aq, . . . ,  xt\,  which  can  be  written 
in  the  form  x  =  [d\  <3  z(  1), . . .  ,  <8 >  z(L)],  For  each  l  6  {1, . . . ,  L},  z(l)  =  for 


some  i  G  {1, . . . ,  5},  and  di  <g>  z(l )  represents  a  duration  di  spent  in  state  z(l). 
Thus,  the  joint  density  has  the  form, 

L 

p(Y,x\M)  =  7 Tz(1)bzW  (Yg-1)  \{az{l_1)Al)bz{l)  (Y?^-1)  ,  (4) 

1=2 

where  Ttz(\)  is  the  probability  that  the  state  sequence  begins  in  state  z(l)\  bzm 
denotes  the  acoustic  segment  pdf  associated  with  state  z(l)\  az(l-i),z(l)  denotes 
the  transition  probability  from  z(l  —  1)  to  z(l);  ti  is  the  time  at  which  the  state 
sequence  x  enters  state  z(l),  and  tL+\  =  T  +  i? 

A  simple  extension  of  the  segmental  Viterbi  decoder  (see,  for  example, 
Holmes  and  Russell  1999)  can  be  used  to  compute  the  optimal  state  sequence  x 
for  a  given  sequence  of  acoustic  vectors  Y±  and  model  A4,  such  that 

p  (Y\M)  =  p  (Y,  x| M)  =  rnaxp  (Y,  x| M. )  .  (5) 

For  completeness,  a  brief  description  of  the  segmental  Viterbi  decoder  is  in¬ 
cluded.  By  analogy  with  the  notation  for  the  forward  probability  used  in  the 
case  of  a  conventional  HMM  (see,  for  example,  Holmes  and  Holmes  2001),  let 

&j(t)  =  maxXu...tXt_1p( yi, . . .  ,y t;xt  =  Sj,xt+ 1  /  Sj).  (6) 


The  final  condition,  x^+i  ^  Sj  is  included  to  ensure  that  only  segments  which 
are  complete  at  time  t  are  considered.  Then,  it  can  be  shown  that, 


atj{t) 


=  max  max  • 


7Tibi(Y{) 

ai(t  -  t )  Oij  6j(rtLT+1) 


for  t  =  t 
for  t  >  t 


(7) 


where  1  <  r  <  rmax  and  ^ r,  is  the  probability  that  the  state  sequence  begins 
in  the  state.  The  requirement  in  Equation  7  to  optimise  over  all  possi¬ 
ble  segment  durations,  r,  and  to  evaluate  segmental  state  output  probabilities, 
6j(l^;LT+1),  leads  to  a  substantial  increase  in  computational  load  relative  to  the 
normal  Viterbi  decoder.  As  in  conventional  Viterbi  decoding  for  continuous 
speech  recognition,  this  algorithm  is  applied  to  a  single,  integrated  MSHMM 
in  which  the  individual  word-  or  phone-level  MSHMMs  are  connected  accord¬ 
ing  to  a  grammar  (Bridle  et  al.  1983).  Thus,  in  the  case  of  phone  recognition 
and  a  bigram  language  model,  the  result  of  decoding  is  the  sequence  of  phones 
[pi, . . . ,  p$]  and  phone  boundaries  [H,  •  •  • ,  t< j>]  such  that  the  joint  probability, 


$ 

p(Y-ti,...,U-pi,...,p^,)  =p(v/1|p1)  (P  (p4,\p^-i)Xl  p  (Y**_1+1\p^  A2)  , 

0=2 

(8) 

2The  introduction  of  symbol  z{l)  to  denote  a  state  simplifies  subsequent  notation,  in  partic¬ 
ular  Eq.  4.  In  the  symbol  xt,  the  time  index  t  is  in  synchrony  with  the  observation  sequence  yt; 
whereas  for  z(l),  the  index  l  is  in  synchrony  with  the  state  transitions.  Unlike  in  a  conventional 
HMM,  the  two  are  not  generally  the  same  in  a  SHMM. 
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is  maximised.  Here,  Ai  is  the  Language  Model  Scale  Factor  (LMSF)  and  A2  is 
the  Token  Insertion  Penalty  (TIP). 

Our  current  software  uses  a  single  implementation  of  the  segmental  Viterbi 
decoder  for  embedded  and  non-embedded  training,  phone  classification  and 
phone  recognition.  This  is  achieved  by  introducing  a  time-indexed  array  of 
breakpoints  that  specify,  at  each  time  t,  whether  a  phone  boundary  is  obliga¬ 
tory,  possible  or  illegal.  An  additional  parameter,  Tmax,  specifies  the  maximum 
permissible  segment  length. 


4  Text-Dependent  Speaker  Verification 


The  most  straightforward  application  of  M-SHMMs  to  speaker  recognition  is 
text-dependent  speaker  verification  (TD-SV).  This  is  because  a  conventional 
TD-SV  system  typically  uses  phone-level  or  word-level  HMMs,  which  can  simply 
be  replaced  by  the  corresponding  M-SHMMs. 

Suppose  that  a  sequence  of  acoustic  feature  vectors  Y  =  [yi, ...,  yjv]  is  claimed 
to  result  from  subject  S  speaking  a  text  Q.  The  decision  whether  to  accept  or 
reject  this  claim  is  based  on  the  likelihood  ration: 


L(S) 


P(Y\S,Q) 
P(Y\Q ) 


(9) 


where  p(Y\S,Q)  is  computed  using  a  set  of  word-  or  phone-level  models  for 
speaker  S,  configured  to  represent  the  text  Q,  and  p(Y\Q)  is  calculated  using  a 
set  of  speaker-independent  models  configured  to  represent  Q. 

Our  experiment  used  the  YOHO  (Higgins  1990)  and  TIMIT  (Garofolo  et  al. 
1993)  speech  corpora.  As  this  was  an  initial  exploration  of  the  application  of 
M-SHMMs  to  speaker  recognition,  we  considered  a  Fixed  Trajectory  Segmental 
HMM,  in  which  there  is  no  intermediate  ‘articulatory-based’  representation  (i.e. 
T  =  A  and  IT,  =  I).  Thus  the  experiment  focusses  on  the  utility  of  improved 
modelling  of  duration  and  dynamics  for  speaker  recognition  (and  not  on  the 
utility  of  introducing  an  intermediate,  ‘articulatory-based’  representation). 


4.1  Experimental  Method 

4.1.1  The  TIMIT  and  YOHO  speech  corpora 

The  YOHO  corpus  comprises  recordings  of  138  subjects  speaking  connected 
digit-sequence  phrases  in  an  office  environment.  It  was  chosen  because  of  its  es¬ 
tablished  use  in  text-dependent  speaker  verification  (Higgins  1990).  The  speech 
in  the  YOHO  corpus  is  sampled  at  8kHz. 

The  TIMIT  corpus  is  very  well-known,  and  comprises  recordings  of  read 
speech.  TIMIT  is  labeled  at  the  phone  level,  and  is  therefore  particularly  useful 
for  building  phone-level  acoustic  models.  Speech  in  the  training  component  of 
the  TIMIT  corpus  was  downsampled  to  8kHz  sampling  rate,  for  compatibility 
with  YOHO. 


10 


4.1.2  Acoustic  parameterisation 

All  of  the  data  was  parameterised  using  the  Hidden  Markov  Model  Tool  Kit 
(HTK)  tool  ‘HCopy’  (Young,  Odell,  Ollason,  Valtchev,  and  Woodland  1997). 
Each  file  is  represented  as  a  sequence  of  13  dimensional  feature  vectors,  one 
every  10ms,  comprising  MFCCs  1  to  12  plus  energy. 

No  A  or  A2  parameters  were  used  in  any  of  the  experiments.  In  fact,  we 
have  not  yet  used  A  or  A2  parameters  in  any  of  our  previous  M-SHMM  based 
speech  recognition  experiments.  This  is  because  part  of  the  motivation  for  the 
development  of  MSHMMs  is  to  obtain  a  better  model  of  speech  dynamics  and 
thereby  obviate  the  need  for  these  parameters. 

In  a  conventional  HMM,  the  assumptions  that  the  underlying  structure  of 
a  speech  segment  is  stationary,  and  that  the  static,  A  and  A2  parameters  are 
non-zero,  are  clearly  inconsistent.  A  trajectory-based  model  could  overcome 
this  inconsistency:  for  such  a  model  to  incorporate  non-zero  A  parameters, 
linear  trajectories  would  be  needed,  while  one  which  included  non-zero  A  and 
A2  would  need  quadratic  trajectories.  The  issues  raised  by  including  dynamic 
features  in  a  conventional  HMM  are  discussed  in  (Bridle  2004). 

4.1.3  Construction  of  initial  acoustic  models  using  TIMIT 

The  TIMIT  training  data  set  was  used  to  estimate  the  initial  parameters  for 
matching  sets  of  context-sensitive  triphone  HMMs  and  MSHMMs.  The  HMMs 
and  the  M-SHMMs  were  built  using  the  SEGVit  M-SHMM  software  toolkit  de¬ 
veloped  at  the  University  of  Birmingham.  In  both  cases,  nronophone  models 
with  three  emitting  states  were  constructed  first,  and  then  used  to  seed  a  set  of 
triphone  models.  The  triphone  model  set  was  defined  using  a  simple  ‘backoff’ 
strategy  whereby  a  triphone  model  was  constructed  if  and  only  if  30  or  more 
examples  of  that  triphone  context  occurred  in  the  training  data,  otherwise  the 
triphone  was  replaced  by  a  biphone  (if  30  or  more  examples  of  the  biphone  con¬ 
text  occurred  in  the  training  data)  or  a  nronophone.  This  is  the  1400  triphone 
model  set  from  (Russell  and  Jackson  2005).  In  the  case  of  MSHMMs,  the  max¬ 
imum  segment  duration  Tmax  was  set  to  15  and  the  duration  probability  mass 
functions  d,  were  non-paranretric  (Ferguson  duration  model  (Ferguson  1980)). 

46  of  these  triphones  were  needed  to  model  the  102  cross-word  triphones  in 
the  YOHO  corpus. 

The  states  of  the  conventional  HMMs  are  associated  with  single  Gaussian 
densities.  This  is  for  compatibility  with  the  M-SHHM  system,  which  currently 
cannot  accommodate  multiple-component  Gaussian  mixture  densities.  The  con¬ 
ventional  nronophone  HMMs  were  intialised  and  reestinrated  using  the  HTK 
tools  ‘HInit’  and  ‘HRest’  respectively  (Young,  Odell,  Ollason,  Valtchev,  and 
Woodland  1997).  These  conventional  nronophone  HMMs  were  also  used  to  seed 
the  nronophone  M-SHMMs,  by  setting  the  M-SHMM  state  mean  and  variance 
vectors  equal  to  the  corresponding  HMM  state  mean  vectors,  and  setting  the 
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M-SHMM  state  slope  vectors  equal  to  zero.  The  ‘self-loop’  state-transition  prob¬ 
abilities  were  set  to  zero  in  the  case  of  the  M-SHMMs,  but  were  non-zero  for  the 
conventional  HMMs. 

4.1.4  Construction  of  background  and  speaker-dependent  models  us¬ 
ing  YOHO 

Models  for  those  triphones  which  occur  in  the  YOHO  data  were  used  to  seed 
speaker-independent  sets  of  YOHO  HMMs  and  M-SHMMs,  which  were  trained 
on  all  of  the  data  from  20  of  the  subjects  (10  female,  10  male)  in  the  YOHO 
corpus.  These  models  formed  the  HMM  and  M-SHMM  Background  Models 
(BMs) .  The  HMM  and  M-SHMM  BMs  were  each  trained  using  20  iterations  of 
Baum- Welch  (HTK)  and  Viterbi-based  (SEGVit)  training  respectively. 

The  remaining  118  subjects  were  used  as  test  subjects.  For  each  of  these 
subjects,  96  files  were  used  to  train  speaker-dependent  HMMs  and  MSHMMs.  As 
with  the  BMs,  the  HMM  and  M-SHMM  SDMs  were  trained  using  20  iterations 
of  Baum- Welch  and  Viterbi-based  training,  respectively.  The  remaining  20  files 
were  split  into  5  test  sets,  each  containing  4  speech  files.  A  single  experiment 
consisted  of  comparing  1  such  test  set  with  a  speaker  dependent  model  and  BM. 
Thus,  for  each  system,  the  number  of  ‘authorised  user’  trials  is  118  x  5  =  590, 
and  the  number  of  ‘impostor’  experiments  is  118  x  117  x  5  =  69030. 

4.2  Results  of  text-dependent  speaker  detection  experiments  on 
YOHO 

The  results  of  the  text-dependent  speaker  verification  experiments  are  shown  as 
DET  curves  in  Appendix  A  (figure  4).  The  lower-bound  of  0.17%  in  the  figure 
for  the  false  rejection  probablity  equates  to  a  single  rejection  out  of  the  590 
‘authorised  user’  trials.  It  is  likely  that  this  results  from  incorrectly  labelled 
data.  Because  of  this  small  number  of  errors  there  is  no  opportunity  to  compare 
the  HMM  and  M-SHMM  systems  in  terms  of  false  rejection  rates  on  this  data 
set.  Both  systems  achieve  an  optimal  false  rejection  rate  of  0.5%. 

The  false  acceptance  rates  for  the  HMM  and  MSHMM  systems  provide  a 
more  useful  comparison.  At  the  optimal  points  these  are  0.52%  for  the  HMM 
system  and  0.29%  for  the  M-SHMM  system,  corresponding  to  359  and  200  false 
acceptances,  respectively.  This  equates  to  a  44%  reduction  in  the  number  of 
false  acceptances  by  using  the  M-SHMM  system,  relative  to  the  conventional 
HMM-based  system. 

4.3  Summary  of  Text-Dependent  Verification  Results 

In  summary,  there  is  some  evidence  from  this  experiment  that  a  M-SHMM- 
based  text-dependent  speaker  verification  system  can  outperform  a  conventional 
HMM-based  system.  This  is  illustrated  by  the  reduction  in  false  acceptance 
errors.  However,  particularly  in  the  case  of  false  rejection  errors,  the  resolution 
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of  this  test  is  not  sufficiently  fine  to  draw  clear  conclusions.  Therefore  it  was 
decided  that  a  more  difficult  speaker-detection  task  should  be  attempted,  namely 
text-independent  speaker  detection  on  the  Switchboard  corpus. 


5  Text-Independent  Speaker  Verification 


5.1  A  ‘segmental  GMM’ 


Although  many  different  approaches  to  text-independent  speaker  detection  have 
been  tried,  the  most  successful  approach  to-date  is  undoubtedly  probabilistic 
classification  using  Gaussian  Mixture  Models  (GMMs)  (Reynolds  1992).  As 
in  the  text-dependent  case,  to  test  the  hypothesis  that  a  sequence  of  acoustic 
feature  vectors  Y  =  [yi.y?,  ...,yjv]  was  spoken  by  a  talker  S,  the  likelihood  ratio 

P(Y\S) 
p(Y) 


is  computed  and  compared  with  a  pre-determined  threshold  T.  The  probability 
p(Y)  is  computed  using  a  ‘Background  Model’  (BM)  or  ‘General  Speaker  Model’ 
(GSM),  which  is  a  GMM  trained  on  acoustic  feature  vectors  corresponding  to 
speech  produced  by  a  large  population  of  talkers.  The  value  of  p(Y\S)  is  com¬ 
puted  using  a  ‘speaker  model’  for  speaker  S,  which  is  a  GMM  trained  on  acoustic 
feature  vectors  derived  from  speech  produced  by  S  (or,  more  normally,  adapted 
from  the  BM).  The  quantity  L(S)  in  equation  (10)  is  an  approximation  to  the 
posterior  probability  of  S  given  the  data  Y,  where  the  prior  probability  P(S) 
of  speaker  S  is  ignored.  The  score  L(S)  is  often  normalised  to  allow  the  same 
threshold  to  be  used  for  all  talkers  (Auckenthaler,  Carey,  and  Lloyd-Thomas 
2000). 

In  order  to  compare  conventional  methods  with  a  M-SHMM-based  method 
for  text-independent  speaker  detection,  it  is  therefore  natural  to  attempt  to  con¬ 
struct  a  segmental  HMM  version  of  a  conventional  GMM  based  speaker  recog¬ 
nition  system. 

In  a  GMM-based  system: 


•  A  speech  signal  is  treated  as  a  sequence  Y  =  [yi,y2,  •••,  Vn]  of  independent 
acoustic  feature  vectors, 

•  p(Y)  is  computed  as  a  product  of  probabilities  p(yt),  p(Y)  =  YlJ=iP{yt ), 
and 


•  Each  p(yt)  is  evaluated  using  a  weighted  sum  of  multivariate  Gaussian 
PDFs  defined  on  the  acoustic  feature  space. 


By  analogy,  in  our  ‘segmental  GMM’: 


Y  will  be  treated  as  a  sequence  of  K  independent  segments  Y  = 

(where  K  depends  on  Y), 


\rtl  V^2 
M  ’*ti+ 1 


VN 
>  1 1 


tx- 1+1 
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•  p(Y)  is  computed  as  a  product  of  probabilities  p(Y^_i+l),  p(Y)  = 
Uk=iP(Y^_i+l),  where  t0  =  0  and  tK  =  N,  and, 

•  Each  p{Y^_  1+1)  is  evaluated  using  a  trajectory-based  segment  model 

Since  the  number  of  segment  segment  boundary  points  K  and  the  values  of  the 
boundary  points  t.\,  t2,  ...,  tx  are  not  known  in  advance,  they  must  be  calcu¬ 
lated  during  the  speaker-detection  process  using  the  segmental  Viterbi  decoder 
from  section  3.3.  By  employing  a  segmental  variant  of  the  forward-backward 
algorithm  for  conventional  HMMs,  it  would  be  possible  to  calculate  p(Y)  by 
summing  over  all  possible  values  of  K  and  segmentations  t\,  t2,  •••,  tx,  and  for 
an  individual  segment  [t^-i  +  1,  tfc]  to  calculate  p(Y*£  +  x)  by  summing  over 
all  segment  models.  However,  in  the  present  study  this  was  discounted  on 
computational  grounds,  and  also  for  the  practical  reason  that  it  would  ne¬ 
cessitate  substantial  development  of  additional  software  within  the  ‘SEGVit’ 
toolkit.  Instead  we  use  the  segmental  Viterbi  decoder  to  find  the  optimal  value 
of  K  and  segmentation  and  for  each  segment  [tk-i  +  1  ,ifc]  we  de¬ 

fine  p(Y^_i+l)  =  maxap{Y%_  i_l_1  |cr),  where  a  ranges  over  all  possible  segment 
models. 

In  terms  of  a  conventional  GMM,  this  is  analogous  to  computing  the  acoustic 
vector  probability  p(yt)  by 


p(yt )  =  rnaxm=it...jMPm(yt ) 


(11) 


rather  than  by 


M 

p(vt)  =  Pm{yt) 

m=l 


(12) 


i.e.  by  choosing  the  best  Gaussian  component  in  the  GMM  instead  of  summing 
over  all  components.  For  consistency,  and  in  order  to  focus  on  the  ‘frame- 
based’  versus  ‘segment-based’  comparison  which  is  the  subject  of  this  research, 
we  use  equation  (11)  rather  than  (12)  in  all  of  our  ‘baseline’  GMM  experiments. 
Once  this  decision  has  been  made,  it  will  be  seen  that  a  conventional  GMM  is 
equivalent  to  a  ‘segmental  GMM’  in  which  the  maximum  segment  duration  Tmax 
is  set  to  1. 


5.2  Construction  of  the  ‘segmental  GMM’ 

Intuitively,  the  most  natural  approach  to  the  problem  of  applying  M-SHMMs 
to  text-independent  speaker  verification  is  to  replace  the  conventional  GMM 
with  a  single  segmental  HMM.  The  ‘segmental  GMM’  consists  of  M  states,  each 
associated  with  the  type  of  variable-duration  linear  trajectory  segment  model  de¬ 
scribed  in  section  3,  specified  by  mean,  slope  and  variance  vectors  in  the  acoustic 
space  and  a  duration  probability  distribution.  These  states  are  configured  in  par¬ 
allel,  with  a  single  initial,  non-emitting,  ‘null’  state  and  a  single  non-emitting 
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Figure  2:  MSHMM  structure  for  text-independent  speaker  verification. 


final  ‘null’  state  (figure  2).  The  segmental  states  are  analogous  to  the  mixture 
components  in  a  conventional  GMM  system,  while  the  transition  probability  Wi 
from  the  initial  null  state  to  the  ith  emitting  segmental  state  corresponds  to  the 
GMM  component  ‘mixture’  weights. 

Given  a  sequence  Y  =  [y\ .  y2,  IJn]  which  is  claimed  to  correspond  to  an 
utterance  spoken  by  speaker  S,  we  compute  the  likelihood  ratio: 


L(S) 


P(Y\S) 

P(Y ) 


(13) 


where  the  speaker-dependent  probability  p(Y\S)  is  given  by: 


K  ^ 

p(Y\S)  =  maxKTnaxtljt2,...,tKmaxa?  ;  cs  n 

H  H  k= 1 

(14) 

In  other  words,  for  the  speaker-dependent  probability  p(T|S)  the  maximum 
is  taken  over  all  possible  numbers  of  segments  K,  all  possible  segmentations 
ti,t2,  —,tK  of  length  K,  and  all  possible  sequences  of  length  K  crf^, 
of  states  from  the  speaker-dependent  model  for  speaker  S.  As  before,  Ai  is  the 
Language  Model  Scale  Factor  (LMSF)  and  A2  is  the  Token  Insertion  Penalty 
(TIP). 
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Similarly  the  BM  probability  p(Y)  is  given  by: 

j  ^  ^ 

p(Y)  =  rrMXjnwxtlttat...ttJmaxtrB  v..)<7b  IIKxj)  V^_1+ikx.j))A2)  (15) 

;  i=i 

For  the  background  probability  p(Y )  the  maximum  is  taken  over  all  possible 
numbers  of  segments  J,  all  possible  segmentations  ti,t2,---,tj  of  length  J,  and 
all  possible  sequences  of  length  J  of  states  from  the  background 

model.  We  use  different  letters  ( K  and  J)  for  the  segment  sequence  lengths 
in  equations  (14)  and  15)  to  emphasise  that,  in  general,  both  the  number  of 
segments  and  the  segment  indeces  will  be  different  for  the  speaker-dependent 
and  background-model  probability  calculations. 

5.2.1  The  Language  Model  Scale  Factor  Ai  and  Token  Insertion 
Penalty  A2 

The  effect  of  the  LMSF  Ai  is  to  control  the  influence  of  the  individual  ‘mixture 
weights’  iUj  and  (in  equation  (15).  A  large  value  of  Ai  will  ‘sharpen’  the 

distribution  wf ,  raf, ...,  and  increase  the  influence  of  the  weights.  Con¬ 
versely,  if  Ai  =  0  then  the  weights  will  have  no  effect  at  all.  The  TIP  A2  is  a 
multiplicative  penalty  which  is  incurred  each  time  a  new  segment  is  hypothe¬ 
sised.  An  explanation  of  a  sequence  Y  which  involves  K  segments  will  incur  a 
penalty  of  \2K  ■  Thus  setting  A2  =  1  will  have  no  effect,  but  setting  A2  >  1  will 
favour  longer  sequences  and  setting  A2  <  1  will  favour  shorter  sequences. 

In  the  ‘SEGVit’  M-SHMM  toolkit,  all  probability  calculations  are  done  in 
the  negative  logarithmic  domain  (where  maximising  a  probability  is  translated 
into  minimising  a  cost),  and  parameters  such  as  the  LMSF  and  TIP  are  specified 
in  the  configuration  file  as  values  in  that  domain.  In  the  negative  logarithmic 
domain  Ai  becomes  a  multiplicative  factor  and  A2  becomes  an  additive  penalty. 
With  respect  to  this  domain,  setting  A2  =  0  will  have  no  effect,  but  setting  A2  >  0 
will  favour  shorter  segment  sequences  (and  hence  longer  individual  segments) 
and  setting  A2  <  0  will  favour  longer  sequences  (and  hence  shorter  individual 
segments).  Thus  the  TIP  parameter  A2  provides  an  external  mechanism  for 
influencing  segment  lengths. 

5.3  Switchboard  data  sets  used 

The  2002  and  2003  NIST  SRE  subsets  of  Switchboard  were  obtained  through 
NIST  and  LDC  to  enable  us  to  evaluate  the  segmental  GMM  for  speaker  detec¬ 
tion  on  the  NIST  2003  SRE  test.  The  experiments  use: 

•  The  one-speaker  training  material  from  the  2002  NIST  SRE  to  train  the 
BM, 
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•  The  one-speaker  training  data  from  the  2003  NIST  SRE  to  train  the  SDMs, 
and 

•  A  subset  of  approximately  50%  of  the  one-speaker  test  data  from  the  2003 
NIST  SRE  as  test  data. 

An  analysis  of  the  systems  used  in  the  2003  NIST  SRE  and  the  results  obtained 
suggests  that  a  suitable  paranreterisation  of  the  speech  signal  would  comprise 
nrel  frequency  cepstral  coefficients  1  to  18,  plus  energy,  plus  the  corresponding 
A  parameters.  However,  in  the  present  system  only  the  static  parameters  were 
used.  This  was  partly  to  reduce  the  computational  load,  and  partly  because  it 
was  hoped  that  explicit  modelling  of  speech  dynamics  would  remove  the  need 
for  the  A  parameters,  as  discussed  earlier  in  section  4.1.1.  The  data  was  then 
paranreterised  as  18  nrel  frequency  cepstral  coefficients  (MFCCs)  plus  an  energy 
measure  (CO)  using  the  HTK  ‘HCopy’  tool3. 

5.4  Training  procedure  for  the  ‘segmental’  GMM  Background 
Model 

An  analysis  of  published  results  for  conventional  GMM  systems  suggests  that 
an  appropriate  number  of  GMM  components  is  of  the  order  of  1024.  However, 
some  researchers  (for  example  Auckenthaler  and  Mason)  have  reported  good 
results  on  Switchboard  data  using  as  few  as  500  components.  In  the  case  of 
our  ‘segmental  GMM’,  the  time  taken  to  train  and  evaluate  a  model  with  1024 
segmental  components  would  preclude  an  extensive  investigation  of  the  effect  of 
different  M-SHMM  variants  and  parameters  on  speaker  recognition  performance. 
Hence,  for  the  current  experiments  the  number  of  segmental  components  in  the 
‘segmental  GMMs’  was  set  to  300  (M  =  300  in  figure  2). 

5.4.1  Factors  influencing  the  performance  of  a  ‘segmental  GMM’ 

The  key  parameters  of  the  ‘segmental  GMM’,  whose  effect  on  verification  per¬ 
formance  we  want  to  measure,  are  as  follows: 

•  The  maximum  segment  duration.  The  parameter  Tmax  specifies  the 
maximum  allowable  segment  duration.  If  Tmax  =  1  then  states  are  associ¬ 
ated  with  individual  feature  vectors,  and  our  ‘segmental  GMM’  reduces  to 
a  type  of  conventional  GMM.  As  Tmax  increases,  the  model  becomes  ‘more 
segmental’  but  the  computational  load  increases.  In  our  experiments  on 
Switchboard,  values  of  1,  5  and  10  were  chosen  for  Tmax. 

3At  first  the  MFCC-based  parameterisation  which  uses  an  explicit  measure  of  energy  was 
chosen  (MFC_E),  however  it  was  found  that  with  this  parameterisation  HCopy  gives  incorrect 
results  —  abnormal  huge  positive  or  negative  numbers  —  for  some  of  the  energy  measure 
parameters  of  Switchboard  data.  This  problem  does  not  occur  if  the  zeroth  MFCC  coefficient 
is  used  instead  (MFC_0) 
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•  The  trajectory  slope.  This  could  be  set  to  zero,  estimated  for  the  BM 
from  training  data  and  then  maintained  at  this  value  for  each  speaker- 
dependent  model,  or  reestinrated  for  each  speaker  model.  The  significance 
of  the  trajectory  slope  parameters  is  likely  to  depend  on  the  Tmax  param¬ 
eter:  with  slope  being  more  significant  for  larger  values  of  Tmax. 

•  The  segment  duration  model.  Again  this  could  be  trained  from  data 
for  the  BM  and  either  passed  unchanged  to  each  speaker-dependent  model 
or  reestinrated  for  each  speaker-dependent  model.  Since  duration  is  a 
segment-level,  rather  than  frame-level,  parameter,  very  few  training  ex¬ 
amples  of  segment  duration  are  likely  to  be  contained  in  a  typical  speaker- 
dependent  adaptation  or  training  set.  Therefore  accurate  estimation  of  a 
speaker-dependent  duration  model  is  likely  to  be  an  issue. 

•  The  language  model  control  parameters  Ai  and  A2.  As  explained 
previously,  the  SEGVit  system  includes  two  parameters,  LMSF  (Ai)  and 
TIP  (A2)  (see  section  5.2.1)  which  can  be  used  to  influence  average  segment 
duration.  If  Ai  and  A2  take  their  default  values  of  1  and  0,  respectively 
(remember  that  these  parameters  operate  in  the  negative  log  probability 
domain),  then  they  have  no  effect  on  the  Viterbi  decoder.  However,  setting 
A2  >  0  will  result  in  shorter  state  sequences  and,  hence,  longer  segments. 
Conversely,  if  A2  <  0  longer  state  sequences  and  shorter  segments  are 
preferred.  Similarly,  setting  Ai  >  1  will  both  sharpen  the  distribution  of 
mixture  weights  (and  therefore  increase  their  influence)  and  decrease  their 
magnitude  (and  therefore  bias  the  decoder  towards  shorter  state  sequences 
and  longer  segments).  Conversely,  choosing  0  <  Ai  <  1  will  ‘flatten’ 
the  distribution  of  mixture  weights  (and  therefore  reduce  their  influence) 
and  increase  their  average  value  (and  therefore  bias  the  decoder  towards 
longer  state  sequences  and  shorter  segments).  Thus  by  adjusting  these  two 
control  parameters  during  training  or  testing,  it  is  possible  to  influence  the 
durational  structures  of  the  segments  in  the  BM  and  speaker-dependent 
models. 

5.4.2  ‘Segmental  GMM’  BM  construction  for  Switchboard 

As  part  of  our  previous  research  on  TIMIT  phone  classification  (Russell  and 
Jackson  2005),  we  have  developed  software  to  produce  sets  of  context-sensitive 
triphone  M-SHMMs  of  varying  sizes  (using  the  monophone  and  biphone  ‘backoff’ 
approach  described  earlier).  Using  this  software  we  have  developed  TIMIT-based 
model  sets  with  between  104  and  5,989  models  (or,  equivalently,  between  312 
and  17,967  states).  By  combining  all  or  a  subset  of  the  states  of  a  suitable  family 
of  models  into  a  single,  integrated  M-SHMM  of  the  type  depicted  in  figure  2  we 
hoped  that  we  could  obtain  a  suitable  initial  model  to  ‘seed’  Viterbi  reestimation 
of  our  segmental  BM  for  Switchboard.  Estimation  of  the  target  speaker  models 
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could  then  proceed  as  previously  described.  For  this  pilot  experiment  we  chose 
the  maximum  segment  duration  Tmax  to  be  equal  to  5. 

Unfortunately  this  did  not  prove  to  be  the  case.  The  dissimilarity  between 
the  TIMIT-based  models  and  the  Switchboard  data  was  such  that  nearly  80% 
of  the  MSHMM  states  were  not  visited  at  all  during  reestinration.  After  two 
iterations  of  the  reestinration,  only  20%  of  the  MSHMM  states  had  non-zero 
‘occupancy’  and  could  therefore  be  reestimated.  Thus  the  effective  number  of 
states  was  significantly  reduced.  We  concluded  that  it  is  not  possible  to  use 
segmental  states  estimated  estimated  on  TIMIT  as  initial  models  for  work  on 
Switchboard. 

As  an  alternative  we  estimated  the  mean  values  of  300  segments  using  k- 
nreans  clustering  applied  to  a  randomly  chosen  subset  of  the  Switchboard  2002 
data.  The  initial  segment  trajectory  slope  values  were  set  to  zero  and  the  state 
duration  distributions  were  set  to  be  uniform. 

These  initial  segment  models  were  used  to  construct  an  initial  ‘segmental 
GMM’  Background  Model,  which  was  optimised  using  the  Viterbi-based  M- 
SHMM  reestinration  functions  in  the  ‘SEGVit’  software  toolkit  and  the  NIST 
2002  SRE  one-speaker  training  set.  The  segment  trajectory  means  and  variances 
were  reestinrated  first,  using  4  iterations  of  Viterbi  training.  Then  the  segment 
trajectory  means,  slopes  and  variances  were  reestinrated  for  a  further  5  iterations. 
The  duration  probabilities  were  only  reestinrated  in  the  final,  5th,  iteration. 

Different  maximum  segment  lengths  corresponding  to  Tmax  =  1,5  and  10 
were  chosen  to  make  3  sets  of  models,  which  we  refer  to  as  SW 1,  SW5,  and 
SW10.  These  models  were  built  to  test  the  effect  of  maximum  segment  duration 
on  speaker-detection  performance.  For  all  model  sets  except  SW  1,  the  segment 
trajectory  means  and  slopes,  variances  and  the  segment  duration  distributions 
were  estimated.  In  the  case  of  SW  1,  only  the  segment  trajectory  means  and 
variances  were  reestinrated,  the  trajectory  slopes  were  set  to  0  and  the  duration 
length  can  only  be  1.  The  model  SW1  was  treated  as  the  counterpart  of  the 
traditional  GMM  system  and  used  as  our  baseline  system. 

5.5  Training  procedure  for  the  speaker-dependent  ‘segmental 
GMMs’ 

Each  trained  BM  from  section  5.4.2  was  used  to  reestinrate  a  speaker-dependent 
‘segmental  GMM’  (SDM)  for  each  of  the  test  speakers  in  the  2003  Switchboard 
test  set.  Data  from  the  2003  Switchboard  training  set  was  used  to  reestinrate 
these  models. 

For  the  BM  set  SW 5  ( Tmax  =  5),  three  different  sets  of  SDMs  were  produced: 

•  In  the  first  set,  5IU5_0,  the  segment  trajectory  mean  vectors  were  reesti¬ 
nrated  but  the  slope  vectors  were  set  to  zero  in  both  the  BM  and  SDMs. 

•  In  the  second  set,  SW 5_1,  only  the  segment  trajectory  mean  values  were 
reestinrated.  The  segment  trajectory  slopes  in  these  models  are  therefore 
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the  same  as  those  of  the  corresponding  segment  models  in  the  BM. 


•  In  the  third  set,  SW5_2,  the  segment  trajectory  slopes  were  also  reesti- 
rnated,  along  with  the  segment  trajectory  means. 

For  the  speaker-dependent  models  the  segment  duration  models  and  variance 
parameters  were  not  reestimated  because  of  the  limited  amount  of  training  data 
which  is  available  for  each  speaker.  The  trajectory  means  were  reestimated  in 
all  cases. 

The  effects  on  performance  of  setting  the  trajectory  slope  values  to  zero  in 
both  the  BM  and  SDMs,  reestinrating  them  for  the  BM  but  not  the  SDMs,  or 
reestinrating  them  for  the  BM  and  SDMs,  were  tested  experimentally. 

5.6  Initial  experiments  on  the  NIST  2003  single-speaker  evalu¬ 
ation  set 

Because  of  the  need  to  run  segmental  Viterbi  decoding  and  to  compute  segment- 
level  probabilities,  the  computational  load  associated  with  our  ‘segmental  GMM‘ 
is  significantly  greater  than  that  associated  with  a  conventional  GMM.  In  order 
to  reduce  this  computational  cost  and  to  improve  experimental  turn-around 
time,  speaker-detection  experiments  were  conducted  using  just  half  of  the  male 
test  speakers  (671  speakers)  and  half  of  the  female  test  speakers  (1042  speakers) 
from  the  NIST  2003  single-speaker  evaluation  set. 

As  specified  in  the  NIST  2003  evaluation  documentation,  for  each  test  file,  11 
different  verification  tests  were  performed.  This  in  turn  involved  12  probability 
calculations  -  one  for  the  background  model  and  11  for  the  speaker-dependent 
models. 

The  following  experiments  were  conducted: 

•  Experiment  1:  This  experiment  investigated  the  effects  on  performance 
of  setting  the  trajectory  slope  values  to  zero  in  both  the  BM  and  SDMs 
(SW5_1),  reestinrating  the  trajectory  slope  vector  for  the  BM  but  not  for 
the  SDMs  (so  that  the  SDM  trajectory  slope  vectors  are  equal  to  the  cor¬ 
responding  BM  slope  vectors,  SW5J2),  and  reestinrating  the  slope  vector 
for  both  the  SDMs  and  the  BM  (SW5_3).  In  this  experiment  Tmax  =  5. 

•  Experiment  2.  The  performances  of  the  systems  with  maximum  duration 
Tmax  set  to  1  (SW1),  5  (SW5)  and  10  (SW10)  were  compared.  In  these 
experiments  all  of  the  BM  trajectory  parameters  were  reestinrated  and  used 
to  seed  the  corresponding  SDM  parameters,  and  all  of  the  SDM  parameters 
were  then  reestinrated  (except  in  the  case  of  SW 1 ,  where  the  slope  vectors 
are  all  zero  -  this  is  the  baseline  system) 
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5.7  Speeding  up  experiment  turn-around  time 

It  has  already  been  noted  that  the  computational  load  associated  with  M- 
SHMMs  is  an  important  issue  (see  section  3.3  and,  in  particular  the  discussion 
after  equation  (3.3)). 

For  this  reason,  the  time  taken  to  train  the  BM  and  the  even  longer  time  re¬ 
quired  for  testing  meant  that  it  would  not  be  possible  to  evaluate  many  different 
variations  of  the  ‘segmental  GMM’  system.  However,  as  we  have  no  previous 
experience  of  applying  these  models  to  Switchboard,  many  experiments  need  to 
be  conducted  to  derive  optimal  values  for  the  maximum  segment  durations  and 
to  establish  the  utility  of  the  different  trajectory  parameters. 

It  has  already  been  noted  that  only  half  of  the  test  set  from  the  NIST  2003 
was  used,  and  this  reduces  the  computation  in  testing  by  50%.  The  number  of 
segmental  states  in  the  model  was  also  kept  low  at  300.  However,  the  computa¬ 
tional  load  was  still  prohibitive. 

5.7.1  Parallelisation  of  the  SEGVit  toolkit 

The  ‘SEGVit’  software  toolkit  has  been  modified  so  that  model  training  can  be 
conducted  in  parallel  on  a  ‘grid’  of  computers.  However  the  computation  time 
is  still  prohibitively  long  for  a  large  detection  task.  For  example,  we  estimate 
that  an  evaluation  of  our  reduced  system,  with  Tmax  =  15  will  take  between  20 
and  25  days  on  our  6-node  cluster. 

5.7.2  Beam  pruning  and  duration  pruning 

Techniques  which  work  for  recognition,  such  as  Beam  Pruning  have  been  ex¬ 
tended  to  the  ‘SEGVit’  toolkit  during  the  period  of  this  project,  but  they  are 
much  less  effective  for  speaker  detection  than  for  speech  recognition.  This  is 
because  at  present  there  is  effectively  no  syntax  to  constrain  possible  segment 
sequences.  In  other  words,  because  each  segment  in  the  ‘segmental  GMM’  can 
be  preceded  by  every  other  segment,  pruning  out  paths  in  the  past  does  not 
alter  the  number  of  segments  which  have  to  evaluated  in  the  present.  We  also 
developed  a  new  technique  which  we  refer  to  as  ‘Duration  Pruning’  whereby  a 
segment  probability  is  not  calculated  if  the  probability  of  its  duration  is  below 
a  pre-determined  threshold.  Again,  this  technique  works  well  for  phone  recogni¬ 
tion  experiments  on  TIMIT  but  appears  to  be  less  useful  for  speaker  detection 
experiments  on  Switchboard. 

5.7.3  Auckenthaler’s  method  for  reducing  computational  load 

In  a  further  attempt  to  speed  up  our  experiments,  we  investigated  a  technique 
described  by  Auckenthaler  in  his  thesis  (Auckenthaler  2001). 

Auckenthaler  proposes  two  methods  to  reduce  the  computational  load  in  a 
conventional  GMM-based  speaker  detection  system: 
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•  In  Auckenthaler’s  first  method  a  ‘bigram’  grammar  for  sequences  of  GMM 
mixture  components  is  built  using  mixture  component  sequences  observed 
on  the  training  data.  For  each  mixture  component  m,  this  bigram  grammar 
is  used  to  identify  the  sub-set  of  N  components  which  are  most  probable 
at  time  t  +  1  if  the  mth  component  is  most  probable  at  time  t.  During 
recognition  the  decoder  is  constrained  so  that  if  a  particular  mixture  com¬ 
ponent  m  is  used  at  time  t.  then  only  this  pre-determined  subset  of  mixture 
components  is  considered  at  time  t  +  1. 

•  The  second  method  exploits  the  link  between  the  BM  and  each  of  the 

SDMs.  Since  each  SDM  is  seeded  by  the  BM,  it  is  argued  that  there 
is  a  strong  connection  between  the  mth  component  of  the  BM  and  the 
corresponding  mth  component  of  the  SDM.  Auckenthaler  reasoned  that  if 
this  is  the  case,  then  given  a  test  utterance  Y  =  [2/1, ...,  j/x1]  the  sequence 
of  mixture  components  m jm, ...,  mt^\  which  is  optimal  for  the  BM  should 
be  close  to  optimal  for  each  SDM.  Therefore,  once  the  optimal  sequence  of 
components  has  been  computed  for  the  BM,  Auckenthaler  uses  exactly  the 
same  sequence  for  each  of  the  SDMs.  In  essence,  this  means  that  for  each 
acoustic  vector  yt  only  the  probability  needs  to  be  evaluated  and 

the  remaining  M  —  1  probabilities  brn  need  not  be  evaluated.  For  a  500 
component  GMM,  this  means  a  499-fold  reduction  in  computational  load. 

In  (Auckenthaler  2001)  the  effects  of  these  techniques  on  detection  performance 
are  documented. 

We  developed  new  software  within  the  ‘SEGVit’  system  to  implement  our 
analogy  to  Auckenthaler’s  second  scheme.  First  the  optimal  state  sequence  be¬ 
tween  a  given  test  utterance  and  the  BM  was  computed.  We  then  assumed  that 
the  same  state  sequence  is  valid  for  the  SDMs,  thereby  removing  the  need  to  do 
further  Viterbi  decoding.  We  tested  this  method  on  the  system  with  Tmax  =  10. 
The  new  method  effectively  reduced  the  processing  time  for  testing  from  more 
than  two  weeks  to  within  3  days,  with  little  loss  in  system  performance.  For  a 
system  with  Tmax  =  15,  the  verification  process  can  be  completed  within  5  days, 
and  the  time  taken  for  whole  training  and  test  process  decreases  from  about  one 
month  to  only  7  or  8  days. 

5.8  Effect  of  Ai  and  A2  on  segment  duration 

As  described  earlier,  the  Language  Model  Scaling  Factor  (Ai)  and  Token  Inser¬ 
tion  Penalty  (A2)  can  be  used  to  alter  the  statistics  of  segment  duration.  Figure 
3  shows  the  effect  of  varying  the  second  parameter,  A2  on  segment  duration 
statistics.  In  these  experiments  Ai  was  set  to  1  while  A2  was  varied  between 
—  10  and  100.  It  is  important  to  note  that  these  statistics  are  obtained  from 
the  test  data.  The  BM  and  SDMs  were  trained  with  rmax  =  10,  Ai  =  1  and 
A2  =  0.  Figure  3  shows  that  the  average  segment  duration  for  the  ‘default’  case 
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Figure  3:  Duration  length  distributions  for  different  values  of  the  Token  Insertion 
Penalty  A2.  LMjxjy  refers  to  the  case  where  \\  =  x  and  A2  =  y ■  LM_  1_0  is  the 
default. 

where  Ai  =  1  and  A2  =  0  is  30ms,  with  a  minimum  duration  of  5ms  and  a  max¬ 
imum  duration  of  70ms.  By  increasing  A2  to  10  the  most  probable  duration  is 
increased  to  80ms.  For  such  large  values  of  A2  it  is  likely  that  there  is  a  conflict 
between  the  effect  of  A2,  which  is  to  increase  the  expected  segment  duration,  and 
the  hard  upper-bound  on  segment  duration  imposed  by  Tmax.  Setting  A2  =  —2 
shifts  the  duration  distribution  slightly  to  the  left  (towards  shorter  durations), 
while  setting  A2  =  —10  causes  all  segments  to  have  minimum  duration,  which  is 
10ms  (or  one  acoustic  vector). 

5.9  Results  of  Switchboard  experiments 
5.9.1  Effect  of  the  trajectory  slope  vector 

The  results  for  the  first  experiment  (experiment  1),  with  model  sets  SW5JL 
<SW5_2  and  SW5_34  are  shown  as  DET  curves  in  Appendix  A  (figure  5).  Recall 
that  Tmax  =  5  in  these  experiments.  The  figure  shows  that  the  equal  error  rate  for 
all  three  systems  is  approximately  14%.  The  best  performance  is  obtained  using 
speaker-dependent  trajectory  slopes  (scheme  3),  but  the  difference  between  this 
and  the  other  results  is  very  small  and  unlikely  to  be  significant.  The  experiment 
in  which  non-zero  BM  slopes  are  estimated,  and  used  to  seed  the  speaker-model 
slopes  but  are  not  subsequently  reestimated  (scheme  2),  gives  results  which  are 
almost  indistinguishable  from  the  zero-slope  result  (schemel). 

4Recall  that  conditions  1,  2  and  3  correspond  to  trajectory  slopes  set  to  zero  in  the  BM  and 
SDMs;  BM  trajectory  slopes  learnt  but  not  reestimated  in  the  SDMs;  BM  trajectory  slopes 
learnt  and  reestimated  for  the  SDMs 
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5.9.2  Effect  of  the  maximum  segment  duration  Tmax 

The  results  of  the  second  experiment  (experiment  2 ) ,  for  systems  with  different 
maximum  durations,  namely  SW 1  ( Tmax  =  1),  SW 5  (Tmax  =  5)  and  SW 10 
(Tmax  =  10)  are  shown  in  Appendix  A  (figure  6).  The  figure  shows  that  the 
systems  with  rmax  =  5  and  Tmax  =  10  work  very  slightly  better  than  the  system 
with  Tmax  =  1>  but  the  differences  are  too  small  to  be  significant.  Recall  that 
SW  1  is  our  approximation  to  a  conventional  GMM. 

These  results  are  certainly  not  as  we  expected.  We  expected  that  in  exper¬ 
iment  1  scheme  1  would  give  poorer  results  than  schemes  2  and  3,  and  thereby 
demonstrate  the  utility  of  modelling  dynamics  by  incorporating  a  non-zero  slope 
parameter.  In  fact  this  experiment  provides  little  evidence  to  support  the  hy¬ 
pothesis  that  the  use  of  linear  segment  models  with  non-zero  trajectory  slopes 
is  beneficial  for  speaker  detection.  This  result  contrasts  with  the  previous  result 
for  YOHO,  where  there  does  appear  to  be  a  benefit. 

In  the  second  set  of  experiments  we  expected  that  SW10,  with  maximum 
segment  duration  set  to  10,  would  outperform  SW 5  (rmax  =  5),  and  that  SW 5 
would  in  turn  outperform  SW  1  (Tmax  =  1).  However  there  is  little  evidence  in 
the  results  to  support  this  expectation.  It  should  also  be  noted  that  the  results  of 
experiments  1  and  2  are  consistent.  If  (as  suggested  by  the  results  of  experiment 
1)  there  is  no  benefit  from  using  a  model  based  on  ‘dynamic’  trajectories  with 
non-zero  slope,  then  one  would  not  expect  to  observe  any  benefit  from  longer 
segments  (since  a  long,  constant  segment  can  be  modelled  just  as  well  by  a 
sequence  of  short,  constant  segments). 

We  note  that  all  of  these  results  are  clearly  much  worse  than  the  best  per¬ 
formance  obtained  on  the  full  2003  test  set  using  a  conventional  GMM  system, 
which  is  a  little  over  5%  equal  error  rate.  This  was  obtained  using  a  2048 
component  GMM  system,  T-norrn  and  a  biologically  inspired  acoustic  pararne- 
terisation.  However,  the  goal  of  these  initial  experiments  was  not  to  challenge 
the  state-of-the-art  in  terms  of  performance,  but  to  conduct  comparative  exper¬ 
iments  to  determine  the  benefits  of  using  a  dynamic,  trajectory-based  model. 

5.9.3  Effects  of  reducing  the  computational  load 

The  results  obtained  by  applying  the  ‘segmental  GMM’  version  of  Auckenthaler’s 
second  method,  described  in  section  5.7.3,  are  shown  in  the  DET  curves  in 
Appendix  B.  Each  figure  shows  two  DET  curves.  The  dashed  (blue)  line  is  the 
same  in  all  of  the  figures  and  is  included  as  a  baseline.  It  shows  the  DET  curve 
obtained  when  separate  Viterbi  decoding  is  applied  to  each  of  the  SDMs  (i.e. 
Auckenthaler’s  method  is  not  used).  For  these  experiments  Ai  =  1,  A2  =  0  and 
Tmax  —  10. 

The  solid  (red)  DET  curves  show  the  results  of  applying  Auckenthaler’s 
method  (i.e.  using  the  optimal  state  sequence  obtained  using  Viterbi  decod¬ 
ing  relative  to  the  BM  to  calculate  the  SDM  probabilities)  together  with  dif- 
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ferent  values  of  language  model  control  parameters  Ai  and  A2  (Ai  =  1;  A2  £ 
{-10,-2,0,2,5,15,50,100}). 

Figure  7  shows  a  direct  comparison,  for  Ai  =  1  and  A2  =  0,  of  the  results 
obtained  with  and  without  the  computational  reduction  due  to  Auckenthaler’s 
method.  The  figure  shows  that  the  DET  curves  are  almost  identical,  with  the 
reduced  computation  method  showing  small  gains  at  each  extreme  of  the  DET 
curve  but  perfornrinslightly  worse  towards  the  centre  of  the  curve.  We  conclude 
that  the  large  reduction  in  computational  load  which  results  from  using  the  op¬ 
timal  BM  state  sequence  to  calculate  the  SDM  probabilities  is  not  compromised 
by  a  significant  change  in  speaker  detection  performance. 

Turning  now  to  the  effects  of  varying  the  Token  Insertion  Penalty  A2  (figures 
8  to  14)  we  see  that  there  is  very  little  difference  between  the  DET  curves 
for  the  different  values  of  A2,  despite  the  large  variation  in  expected  segment 
duration  shown  in  figure  3.  In  particular,  it  is  certainly  not  the  case  that  (as 
one  might  have  expected)  performance  reaches  a  maximum  for  some  positive 
value  of  A2.  Indeed,  larger  values  of  A2  lead  to  decreases  in  performance,  and 
the  best  performance  is  obtained  with  A2  =  —2.  From  figure  3  this  value  of  A2 
corresponds  to  an  expected  segment  duration  of  between  20nrs  and  30ms.  It 
seems  that  shorter  segment  duration  lengths  give  the  best  performance,  which  is 
quite  different  from  what  we  expected  but  consistent  with  the  results  for  varying 

T~max- 

At  this  point  we  noted  a  possible  incompatibility  in  these  experiments.  The 
language  model  control  parameter  A2  was  only  varied  during  testing  and  not 
during  training.  Therefore  it’s  effect  on  segment  duration  during  testing  is 
incompatible  with  the  duration  models  learnt  during  training.  To  make  the 
effect  of  the  language  model  control  parameters  compatible  with  the  model  du¬ 
rations,  additional  experiments  were  carried  out.  In  these  experiments,  the  lan¬ 
guage  model  control  parameter  A2  was  the  same  in  model  training  as  in  testing 
(A2  £  {5,15,50}). 

The  results  of  these  experiments  are  shown  in  Appendic  C.  The  DET  curves 
for  the  systems  which  use  the  optimal  BM  state  sequences  when  calculating  SDM 
probabilities  are  shown  with  a  solid  (green)  grey  line  (this  is  the  ‘Auckenthaler 
method’).  The  DET  curves  for  systems  which  apply  Viterbi  decoding  separately 
to  the  BM  and  SDMs  are  shown  with  a  dashed  (blue)  line  (Ai  =  1;  A2  =  0).  The 
DET  curve  for  a  conventional  GMM  system  is  shown  with  a  solid,  black  line. 

The  results  are  similar  to  those  in  Appendix  B.  These  support  the  hypothesis 
that  the  results  in  Appendix  B  are  not  affected  significantly  by  use  of  different 
values  of  A2  in  training  and  testing.  As  in  Appendix  B,  the  DET  curves  in 
Appendix  C  show  a  trend  whereby  performance  decreases  as  A2  (and  hence  the 
average  segment  durations)  increases.  The  figures  confirm,  again,  that  Aucken¬ 
thaler’s  method  has  little  effect  on  performance. 
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6  Visualisation  of  the  ‘segmental  GMM’  segment 
models 

The  results  of  our  speaker  detection  experiments  on  Switchboard  are  not  as 
expected.  We  have  been  unable  to  demonstrate  any  benefit  from  the  use  of 
‘dynamic’  segments  based  on  linear  trajectories  with  non-zero  slope.  Hence 
we  have  also  not  been  able  to  demonstrate  any  benefit  from  the  use  of  longer 
segments.  This  results  is  at  odds  with  our  earlier  speaker  detection  results  on 
YOHO,  described  in  section  4.2,  and  with  the  phone  recognition  results  presented 
in  (Russell  and  Jackson  2005). 

In  order  to  try  to  understand  this  result,  we  have  written  a  MatLab  program 
to  visualise  the  individual  segment  models  in  the  ‘segmental  GMM’.  The  results 
are  illustrated  in  Appendix  D. 

For  each  segment,  we  computed  linear  trajectories  for  all  19  MFC  coefficients. 
The  length  of  a  segment  is  its  average  length,  based  on  its  duration  distribution. 
This  results  in  a  sequence  of  19  dimensional  MFCC  vectors.  We  then  applied 
an  inverse  Discrete  Cosine  Transform  to  each  of  these  vectors  to  obtain  a  nrel 
frequency  spectrum,  whose  frequency  axis  was  then  warped  to  obtain  a  linear 
frequency  spectrum.  The  resulting  sequence  of  linear  spectral  vectors  is  displayed 
as  a  grey-scale  spectrogram  to  give  one  of  the  figures  in  appendix  C. 

Visual  inspection  of  these  ‘spectrograms’  suggests  that  they  are  all  valid 
speech  segments,  and  that  they  correspond  to  different  components  of  a  plausible 
segmental  model  of  speech.  For  example,  the  second  segment  in  the  third  row  on 
the  first  page  of  appendix  D  is  clearly  vowel  like,  while  the  segment  in  position 
(5,1)  is  more  fricative-like.  The  figures  show  a  mixture  of  stationary  and  non¬ 
stationary  segments. 

In  summar,  visual  inspection  of  these  segments  does  not  reveal  any  obvious 
problems,  and  a  method  for  more  detailed  analysis  is  needed. 

7  Provision  of  SEGVit  software  toolkit  to  AFRL 

In  addition  to  conducting  the  speaker-detection  experiments  which  are  described 
in  this  report,  we  also  provided  Dr  Timothy  Anderson’s  research  group  at  the 
Air  Force  Research  Laboratory  (AFRL)  at  Wright-Patterson  Air  Force  Base, 
Dayton,  Ohio,  with  a  copy  of  the  SEGVit  toolkit  and  with  guidance  on  how 
to  use  it.  This  was  to  enable  AFRL  to  evaluate  M-SHMMs  for  phone-based 
speaker-detection  on  Switchboard,  using  the  SRI  phone-level  annotations  of  the 
Switchboard  corpus.  To  achieve  this,  various  changes  to  the  SEGVit  software 
were  required,  and  these  were  implemented,  tested  and  sent  to  AFRL. 
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8  Conclusions  and  further  work 


This  report  has  described  the  main  results  obtained  during  the  12  month 
EOARD  project  #033060  “Speaker  verification  using  a  dynamic,  ‘articulatory’ 
segmental  hidden  Markov  model”,  which  started  on  1st  October  2003. 

The  results  of  text-dependent  speaker  verification  experiments  on  the  YOHO 
corpus  are  presented  first.  These  show  a  44%  decrease  in  false  acceptance  rate  for 
a  segmental  HMM  based  system,  relative  to  a  conventional  HMM-based  system, 
for  the  same  false  rejection  rate.  However,  the  false  rejection  rate  is  too  small 
to  draw  firm  conclusions  about  the  relative  merits  of  the  two  approaches.  Hence 
our  attention  moved  away  from  YOHO  to  the  Switchboard  corpus. 

To  conduct  text-independent  speaker  detection  experiments  on  Switchboard 
we  developed  a  type  of  ‘segmental  GMM’,  which  models  speech  as  a  sequence  of 
outputs  of  a  set  of  linear-trajectory-based  statistical  segment  models.  However, 
due  to  the  requirement  to  do  segmental  Viterbi  decoding  and  the  need  to  com¬ 
pute  segment-level  probabilities,  the  computational  demands  of  this  model  are 
prohibitive.  We  overcame  this  problem  as  follows: 

•  We  only  conducted  experiments  on  50%  of  the  NIST  2003  SRE  single¬ 
speaker  test  set 

•  We  developed  a  parallel  version  of  the  ‘SEGVit’  software  toolkit,  which 
enabled  training  to  be  spread  over  a  grid  of  computers 

•  We  incorporated  versions  of  beam  pruning  and  ‘duration  pruning’  into  the 
‘SEGVit’  software. 

•  We  successfully  extended  Auckenthaler’s  method,  whereby  the  optimal  BM 
state  sequence  is  used  to  compute  the  SDM  probabilities,  to  our  ‘segmental 
GMM’ 

By  combining  these  methods  we  were  able  to  run  a  speaker  detection  experiment 
on  our  reduced  NIST  2003  test  set  in  a  few  days.  For  example,  the  time  taken 
to  evaluate  a  system  with  maximum  segment  duration  equal  to  10  on  our  6  node 
cluster  was  reduced  from  two  weeks  to  three  days. 

The  main  results  of  our  experiments  on  Switchboard  are  as  follows: 

•  The  techniques  described  above  to  reduce  the  computational  load  were  very 
successful  and  had  no  significant  effect  on  speaker  detection  performance 

•  On  the  Switchboard  corpus,  we  were  unable  to  demonstrate  any  benefit 
from  the  use  of  ‘dynamic’  segments  based  on  linear  trajectories  with  non¬ 
zero  slope 

•  Consequently,  we  were  unable  to  demonstrate  any  significant  benefit  from 
the  use  of  long  segments.  Indeed,  the  best  performance  was  obtained  with 
segments  with  an  expected  duration  of  between  20ms  and  30ms,  obtained 
by  setting  the  Token  Insertion  Penalty  A2  to  -2. 
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The  discrepancy  between  the  performance  of  M-SHMMs  for  text-dependent 
detection  on  YOHO  and  their  performance  for  text-independent  detection  on 
Switchboard  is  puzzling.  There  are  at  least  two  possible  explanations: 

•  The  experiments  on  YOHO  are  text-dependent  and  use  the  YOHO  word- 
level  labeling.  This  labeling  enabled  us  to  use  phone-level  models  in 
speaker  detection.  By  contrast,  no  labels  were  used  in  the  case  of  Switch¬ 
board  and  the  models  were  ‘machine  learnt’  segment  models  with  no  ex¬ 
plicit  phonetic  interpretation.  It  could  be  that  some  sort  of  explicit  labeling 
is  needed  to  guide  the  segmental  model  building  process.  However,  par¬ 
allel  experiments  were  conducted  at  AFRL  using  the  ‘SEGVit’  software 
toolkit  and  the  automatically-derived  SRI  Switchboard  phone-level  labels 
to  build  phone-level  trajectory-based  M-SHMMs.  These  models  performed 
worse  than  a  conventional  GMM-based  system  in  tests  where  both  systems 
had  comparable  numbers  of  parameters.  This  suggests  that  the  absence  of 
phone  level  labeling  in  Switchboard  is  not  the  answer. 

•  An  alternative  explanation  is  that  the  discrepancy  is  due  to  the  different 
styles  of  speech  in  the  YOHO  and  Switchboard  corpora.  While  YOHO 
contains  recordings  of  read  speech,  Switchboard  comprises  recordings  of 
conversational  speech  over  various  telephone  chanels.  The  poorer  quality 
of  the  Switchboard  speech  might  have  caused  difficulty  for  the  data-driven 
segment  model  learning  process,  or,  alternatively,  cues  which  the  segment 
models  were  able  to  use  in  the  YOHO  corpus  may  be  absent  in  Switch¬ 
board. 

To  test  the  hypothesis  that  the  poor  quality  of  the  Switchboard  data  com¬ 
promises  the  data-driven  segment  model  learning  process,  we  developed  MatLab 
code  to  visualise  the  individual  segment  models.  However,  inspection  of  these 
representations  of  the  individual  segment  models  in  our  ‘segmental  GMM’  does 
not  reveal  any  obvious  problems  -  the  segment  model  set  appears  to  cover  a 
range  of  speech-like  segments.  However,  the  true  quality  of  the  segment  models 
is  difficult  to  judge,  and  better  visualisation  tools  are  needed.  In  particular  we 
need  to  extend  our  segment  model  visualisation  tools  to  enable  us  to  display  real 
spectrograms  of  Switchboard  data  alongside  a  representation  of  the  spectrogram 
corresponding  to  the  optimal  sequence  of  segment  models.  This  should  give  a 
much  better  understanding  of  the  accuracy  of  the  model. 

At  present,  our  main  conclusion  is  that  the  fact  that  the  inclusion  of  dynamic 
segments,  corresponding  to  trajectories  with  non-zero  slope,  consistently  fails  to 
improve  speaker  detection  accuracy  on  Switchboard,  suggests  that  these  dynamic 
regions  do  not  contain  information  which  helps  to  differentiate  between  speakers 
in  this  corpus.  If  this  is  true,  it  would  go  some  way  to  explaining  the  success  of 
conventional  GMM-based  approaches  to  speaker  detection  on  Switchboard.  It  is 
also  possible  that  these  dynamic  regions  are  more  useful  in  a  non-conversational 
corpus  like  YOHO. 
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To  confirm  this  hypothesis,  we  believe  that  it  is  important  to  conduct  further 
work  to  determine  the  exact  contribution  of  dynamic  regions  of  a  speech  signal 
to  speaker-detection  accuracy.  In  the  context  of  our  current  work,  we  can  define 
dynamic  regions  of  a  speech  signal  to  be  those  which  align  with  segments  with 
large  slope  values  in  the  segmental  GMM.  By  nreasuing  the  contribution  to 
the  likelihood  ratio  of  individual  segments  of  a  speech  signal,  we  will  be 

able  to  measure  the  relative  contributions  of  static  and  dynamic  segments  to  the 
speaker-detection  decision.  We  propose  to  apply  this  analysis  to  Switchboard,  to 
test  our  hypothesis,  and  to  YOHO  to  see  if  the  contribution  of  dynamic  regions 
is  more  important  for  a  read,  and  therefore  more  carefully  articulated,  corpus. 
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APPENDIX  A 

Results  of  text-dependent  and  text-independent  speaker  verification  experi¬ 
ments. 


Speaker  Detection  Performance 


Figure  4:  Text-dependent  speaker  verification  results  on  YOHO  using  HMMs  (dashed 
line)  and  MSHMMs  (solid  line). 
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Speaker  Detection  Performance 


Figure  5:  Speaker  verification  results  on  a  50%  subset  of  the  NIST  2003  Switchboard 
‘one-speaker’  test  set,  using  linear  trajectory  segmental  HMMs  with  Tmax  =  5.  The 
results  are  for  trajectory  slopes  set  to  zero  in  both  the  BM  and  the  SDMs  (scheme  1  - 
black  line) ,  trajectory  slopes  reestimated  for  the  BM  but  not  reestimated  for  the  SDMs 
(scheme  2  -  green  line),  and  reestimated  for  each  of  the  SDMs  (scheme  3  -  red  line). 


Speaker  Detection  Performance 


Figure  6:  Speaker  verification  results  on  a  50%  subset  of  the  NIST  2003  Switchboard 
‘one-speaker’  test  set,  using  linear  trajectory  segmental  HMMs  with  Tmax  =  1  (scheme 
1  -  black  solid  line),  Tmax  =  5  (scheme  2  -  dashed  line),  rmax  =  10  (scheme  3  -  green 
solid  line)  . 
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APPENDIX  B 


Results  of  experiments  to  investigate  the  effect  of  using  the  BM  optimal  state 
sequence  when  computing  the  SDM  probabilities,  and  different  values  of  Ai  and 
^2- 

The  DET  curves  for  the  systems  which  use  the  optimal  BM  state  sequences 
when  calculating  SDM  probabilities  are  shown  with  a  solid  line  (this  is  the  ‘Auck- 
enthaler  method’).  The  DET  curves  for  systems  which  apply  Viterbi  decoding 
separately  to  the  BM  and  SDMs  are  shown  with  a  dashed  line.  This  line  is  the 
same  in  all  of  the  figures  and  corresponds  to  Ai  =  1;  A2  =  0. 


Speaker  Detection  Performance 


Figure  7:  Ai  =  1;  A2  =  0  . 


Speaker  Detection  Performance 


Figure  8:  Ai  =  1;  A2  =  2  . 
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Speaker  Detection  Performance 


Figure  9:  Ai  =  1;  X2  =  5  .  Figure  11:  Ai  =  1;  A2  =  50  . 


Speaker  Detection  Performance 


Figure  10:  Ai  =  1;  A2  =  15  . 


Figure  12:  Ai  =  1;  A2  =  100  . 
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Speaker  Detection  Performance 


Figure  13:  Ai  =  1;  X2  =  —  2  . 


Speaker  Detection  Performance 


Figure  14:  Ai  =  1;  A2  =  —10  . 
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APPENDIX  C 


Results  of  experiments  to  investigate  the  effect  of  using  the  BM  optimal  state 
sequence  when  computing  the  SDM  probabilities,  and  different  values  of  Ai  and 
A2 •  Experiments  are  as  in  Appendix  B,  except  that  the  same  values  of  Ai  and 
A2  are  used  in  training  and  recognition. 

The  DET  curves  for  the  systems  which  use  the  optimal  BM  state  sequences 
when  calculating  SDM  probabilities  are  shown  with  a  solid  grey  line  (this  is 
the  ‘Auckenthaler  method’).  The  DET  curves  for  systems  which  apply  Viterbi 
decoding  separately  to  the  BM  and  SDMs  are  shown  with  a  dashed  line  (Ai  =  1; 
A2  =  0).  The  DET  curve  for  a  conventional  GMM  system  is  shown  with  a  solid, 
black  line. 
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Figure  15:  LMScale  =  1;  LMInsP  =  5  . 
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Speaker  Detection  Performance 


Figure  16:  LMScale  =  1;  LMInsP  =  15  . 
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Figure  17:  LMScale  =  1;  LMInsP  =  50  . 
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APPENDIX  D 

Spectrograms  corresponding  to  trained  segments  from  the  SDM  for  female 

speaker  5090 
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